1,987 research outputs found

    Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment

    Get PDF
    COST Action CA18131 Cierva Grant IJC2019-042188-I (LM-Z) Estonian Research Council grant PUT 1371The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.publishersversionpublishe

    Robust Identification of Target Genes and Outliers in Triple-negative Breast Cancer Data

    Get PDF
    Correct classification of breast cancer sub-types is of high importance as it directly affects the therapeutic options. We focus on triple-negative breast cancer (TNBC) which has the worst prognosis among breast cancer types. Using cutting edge methods from the field of robust statistics, we analyze Breast Invasive Carcinoma (BRCA) transcriptomic data publicly available from The Cancer Genome Atlas (TCGA) data portal. Our analysis identifies statistical outliers that may correspond to misdiagnosed patients. Furthermore, it is illustrated that classical statistical methods may fail in the presence of these outliers, prompting the need for robust statistics. Using robust sparse logistic regression we obtain 36 relevant genes, of which ca. 60\% have been previously reported as biologically relevant to TNBC, reinforcing the validity of the method. The remaining 14 genes identified are new potential biomarkers for TNBC. Out of these, JAM3, SFT2D2 and PAPSS1 were previously associated to breast tumors or other types of cancer. The relevance of these genes is confirmed by the new DetectDeviatingCells (DDC) outlier detection technique. A comparison of gene networks on the selected genes showed significant differences between TNBC and non-TNBC data. The individual role of FOXA1 in TNBC and non-TNBC, and the strong FOXA1-AGR2 connection in TNBC stand out. Not only will our results contribute to the breast cancer/TNBC understanding and ultimately its management, they also show that robust regression and outlier detection constitute key strategies to cope with high-dimensional clinical data such as omics data

    Classification and biomarker selection in lower-grade glioma using robust sparse logistic regression applied to RNA-seq data

    Get PDF
    Funding Information: This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with references CEECINST/00102/2018, UIDB/00297/2020 and UIDB/00297/2020 (NOVA MATH, Center for Mathematics and Applications), UIDB/04516/2020 (NOVA LINCS), and the research project “MONET – Multi-omic networks in gliomas” (PTDC/CCI-BIO/4180/2020). The results presented are based upon data generated by the TCGA Research Network: https://www.cancer. gov/tcga. Publisher Copyright: © Brazilian Journal of Biometrics.Effective diagnosis and treatment in cancer is a barrier for the development of personalized medicine, mostly due to tumor heterogeneity. In the particular case of gliomas, highly heterogeneous brain tumors at the histological, cellular and molecular levels, and exhibiting poor prognosis, the mechanisms behind tumor heterogeneity and progression remain poorly understood. The recent advances in biomedical high-throughput technologies have allowed the generation of large amounts of molecular information from the patients that combined with statistical and machine learning techniques can be used for the definition of glioma subtypes and targeted therapies, an invaluable contribution to disease understanding and effective management. In this work sparse and robust sparse logistic regression models with the elastic net penalty were applied to glioma RNA-seq data from The Cancer Genome Atlas (TCGA), to identify relevant tran-scriptomic features in the separation between lower-grade glioma (LGG) subtypes and identify putative outlying observations. In general, all classification models yielded good accuracies, selecting different sets of genes. Among the genes selected by the models, TXNDC12, TOMM20, PKIA, CARD8 and TAF12 have been reported as genes with relevant role in glioma development and progression. This highlights the suitability of the present approach to disclose relevant genes and fosters the biological validation of non-reported genes.publishersversionpublishe

    Twiner: correlation-based regularization for identifying common cancer gene signatures

    Get PDF
    © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.Background: Breast and prostate cancers are typical examples of hormone-dependent cancers, showing remarkable similarities at the hormone-related signaling pathways level, and exhibiting a high tropism to bone. While the identification of genes playing a specific role in each cancer type brings invaluable insights for gene therapy research by targeting disease-specific cell functions not accounted so far, identifying a common gene signature to breast and prostate cancers could unravel new targets to tackle shared hormone-dependent disease features, like bone relapse. This would potentially allow the development of new targeted therapies directed to genes regulating both cancer types, with a consequent positive impact in cancer management and health economics. Results: We address the challenge of extracting gene signatures from transcriptomic data of prostate adenocarcinoma (PRAD) and breast invasive carcinoma (BRCA) samples, particularly estrogen positive (ER+), and androgen positive (AR+) triple-negative breast cancer (TNBC), using sparse logistic regression. The introduction of gene network information based on the distances between BRCA and PRAD correlation matrices is investigated, through the proposed twin networks recovery (twiner) penalty, as a strategy to ensure similarly correlated gene features in two diseases to be less penalized during the feature selection procedure. Conclusions: Our analysis led to the identification of genes that show a similar correlation pattern in BRCA and PRAD transcriptomic data, and are selected as key players in the classification of breast and prostate samples into ER+ BRCA/AR+ TNBC/PRAD tumor and normal tissues, and also associated with survival time distributions. The results obtained are supported by the literature and are expected to unveil the similarities between the diseases, disclose common disease biomarkers, and help in the definition of new strategies for more effective therapies.This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with references UID/EEA/50008/2019 (Instituto de Telecomunicações), UID/CEC/50021/2019 (INESC-ID), UID/EMS/50022/2019 (IDMEC, LAETA), PREDICT (PTDC/CCI-CIF/29877/2017), and PERSEIDS (PTDC/EMS-SIS/0642/2014).info:eu-repo/semantics/publishedVersio

    Tcox: Correlation-based regularization applied to colorectal cancer survival data

    Get PDF
    This work was partially supported by national funds through Fundacao para a Ciencia e a Tecnologia (FCT) with references PD/BD/139146/2018, IF/00409/2014, UIDB/50021/2020 (INESC-ID), UIDB/50022/2020 (IDMEC), UIDB/04516/2020 (NOVA LINCS), and UIDB/00297/2020 (CMA) and projects PREDICT (PTDC/CCI-CIF/29877/2017) and MATISSE (DSAIPA/DS/0026/2019).Colorectal cancer (CRC) is one of the leading causes of mortality and morbidity in the world. Being a heterogeneous disease, cancer therapy and prognosis represent a significant challenge to medical care. The molecular information improves the accuracy with which patients are classified and treated since similar pathologies may show different clinical outcomes and other responses to treatment. However, the high dimensionality of gene expression data makes the selection of novel genes a problematic task. We propose TCox, a novel penalization function for Cox models, which promotes the selection of genes that have distinct correlation patterns in normal vs. tumor tissues. We compare TCox to other regularized survival models, Elastic Net, HubCox, and OrphanCox. Gene expression and clinical data of CRC and normal (TCGA) patients are used for model evaluation. Each model is tested 100 times. Within a specific run, eighteen of the features selected by TCox are also selected by the other survival regression models tested, therefore undoubtedly being crucial players in the survival of colorectal cancer patients. Moreover, the TCox model exclusively selects genes able to categorize patients into significant risk groups. Our work demonstrates the ability of the proposed weighted regularizer TCox to disclose novel molecular drivers in CRC survival by accounting for correlation-based network information from both tumor and normal tissue. The results presented support the relevance of network information for biomarker identification in high-dimensional gene expression data and foster new directions for the development of network-based feature selection methods in precision oncology.publishersversionpublishe

    RObust Sparse ensemble for outlIEr detection and gene selection in cancer omics data

    Get PDF
    We thank Peter Segaert for providing his adapted code of the enetLTS method. The results presented here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga . Funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC 2075 390740016 (AJ and NR). Publisher Copyright: © The Author(s) 2022.The extraction of novel information from omics data is a challenging task, in particular, since the number of features (e.g. genes) often far exceeds the number of samples. In such a setting, conventional parameter estimation leads to ill-posed optimization problems, and regularization may be required. In addition, outliers can largely impact classification accuracy. Here we introduce ROSIE, an ensemble classification approach, which combines three sparse and robust classification methods for outlier detection and feature selection and further performs a bootstrap-based validity check. Outliers of ROSIE are determined by the rank product test using outlier rankings of all three methods, and important features are selected as features commonly selected by all methods. We apply ROSIE to RNA-Seq data from The Cancer Genome Atlas (TCGA) to classify observations into Triple-Negative Breast Cancer (TNBC) and non-TNBC tissue samples. The pre-processed dataset consists of (Formula presented.) genes and more than (Formula presented.) samples. We demonstrate that ROSIE selects important features and outliers in a robust way. Identified outliers are concordant with the distribution of the commonly selected genes by the three methods, and results are in line with other independent studies. Furthermore, we discuss the association of some of the selected genes with the TNBC subtype in other investigations. In summary, ROSIE constitutes a robust and sparse procedure to identify outliers and important genes through binary classification. Our approach is ad hoc applicable to other datasets, fulfilling the overall goal of simultaneously identifying outliers and candidate disease biomarkers to the targeted in therapy research and personalized medicine frameworks.publishersversionpublishe

    The role of network science in glioblastoma

    Get PDF
    Network science has long been recognized as a well-established discipline across many biological domains. In the particular case of cancer genomics, network discovery is challenged by the multitude of available high-dimensional heterogeneous views of data. Glioblastoma (GBM) is an example of such a complex and heterogeneous disease that can be tackled by network science. Identifying the architecture of molecular GBM networks is essential to understanding the information flow and better informing drug development and pre-clinical studies. Here, we review network-based strategies that have been used in the study of GBM, along with the available software implementations for reproducibility and further testing on newly coming datasets. Promising results have been obtained from both bulk and single-cell GBM data, placing network discovery at the forefront of developing a molecularly-informed-based personalized medicine.This work was partially supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with references CEECINST/00102/2018, CEECIND/00072/2018 and PD/BDE/143154/2019, UIDB/04516/2020, UIDB/00297/2020, UIDB/50021/2020, UIDB/50022/2020, UIDB/50026/2020, UIDP/50026/2020, NORTE-01-0145-FEDER-000013, and NORTE-01-0145-FEDER000023 and projects PTDC/CCI-BIO/4180/2020 and DSAIPA/DS/0026/2019. This project has received funding from the European Union’s Horizon 2020 research and innovation program under Grant Agreement No. 951970 (OLISSIPO project)

    Forecasting biotoxin contamination in mussels across production areas of the Portuguese coast with Artificial Neural Networks

    Get PDF
    Harmful algal blooms (HABs) and the consequent contamination of shellfish are complex processes depending on several biotic and abiotic variables, turning prediction of shellfish contamination into a challenging task. Not only the information of interest is dispersed among multiple sources, but also the complex temporal relationships between the time-series variables require advanced machine methods to model such relationships. In this study, multiple time-series variables measured in Portuguese shellfish production areas were used to forecast shellfish contamination by diarrhetic she-llfish poisoning (DSP) toxins one to four weeks in advance. These time series included DSP con-centration in mussels (Mytilus galloprovincialis), toxic phytoplankton cell counts, meteorological, and remotely sensed oceanographic variables. Several data pre-processing and feature engineering methods were tested, as well as multiple autoregressive and artificial neural network (ANN) models. The best results regarding the mean absolute error of prediction were obtained for a bivariate long short-term memory (LSTM) neural network based on biotoxin and toxic phytoplankton measurements, with higher accuracy for short-term forecasting horizons. When evaluating all ANNs model ability to predict the contamination state (below or above the regulatory limit for contamination) and changes to this state, multilayer perceptrons (MLP) and convolutional neural networks (CNN) yielded improved predictive performance on a case-by-case basis. These results show the possibility of extracting relevant information from time-series data from multiple sources which are predictive of DSP contamination in mussels, therefore placing ANNs as good candidate models to assist the production sector in anticipating harvesting interdictions and mitigating economic losses.(c) 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).info:eu-repo/semantics/publishedVersio

    A review of recent machine learning advances for forecasting harmful Algal Blooms and shellfish contamination

    Get PDF
    Harmful algal blooms (HABs) are among the most severe ecological marine problems worldwide. Under favorable climate and oceanographic conditions, toxin-producing microalgae species may proliferate, reach increasingly high cell concentrations in seawater, accumulate in shellfish, and threaten the health of seafood consumers. There is an urgent need for the development of effective tools to help shellfish farmers to cope and anticipate HAB events and shellfish contamination, which frequently leads to significant negative economic impacts. Statistical and machine learning forecasting tools have been developed in an attempt to better inform the shellfish industry to limit damages, improve mitigation measures and reduce production losses. This study presents a synoptic review covering the trends in machine learning methods for predicting HABs and shellfish biotoxin contamination, with a particular focus on autoregressive models, support vector machines, random forest, probabilistic graphical models, and artificial neural networks (ANN). Most efforts have been attempted to forecast HABs based on models of increased complexity over the years, coupled with increased multi-source data availability, with ANN architectures in the forefront to model these events. The purpose of this review is to help defining machine learning-based strategies to support shellfish industry to manage their harvesting/production, and decision making by governmental agencies with environmental responsibilities.CEECINST/00102/2018/ UIDB/04516/2020/ UIDB/00297/2020/ UIDB/50021/2020/ UID/Multi/04326/2020info:eu-repo/semantics/publishedVersio

    Role of Src homology domain binding in signaling complexes assembled by the murid γ-herpesvirus M2 protein

    Get PDF
    γ-Herpesviruses express proteins that modulate B lymphocyte signaling to achieve persistent latent infections. One such protein is the M2 latency-associated protein encoded by the murid herpesvirus-4. M2 has two closely spaced tyrosine residues, Tyr120 and Tyr129, which are phosphorylated by Src family tyrosine kinases. Here we used mass spectrometry to identify the binding partners of tyrosine-phosphorylated M2. Each M2 phosphomotif is shown to bind directly and selectively to SH2-containing signaling molecules. Specifically, Src family kinases, NCK1 and Vav1, bound to the Tyr(P)120 site, PLCγ2 and the SHP2 phosphatase bound to the Tyr(P)129 motif, and the p85α subunit of PI3K associated with either motif. Consistent with these data, we show that M2 coordinates the formation of multiprotein complexes with these proteins. The effect of those interactions is functionally bivalent, because it can result in either the phosphorylation of a subset of binding proteins (Vav1 and PLCγ2) or in the inactivation of downstream targets (AKT). Finally, we show that translocation to the plasma membrane and subsequent M2 tyrosine phosphorylation relies on the integrity of a C-terminal proline-rich SH3 binding region of M2 and its interaction with Src family kinases. Unlike other γ-herpesviruses, that encode transmembrane proteins that mimic the activation of ITAMs, murid herpesvirus-4 perturbs B cell signaling using a cytoplasmic/membrane shuttling factor that nucleates the assembly of signaling complexes using a bilayered mechanism of phosphotyrosine and proline-rich anchoring motifs.This work was supported by Portuguese Fundação para a Ciência e Tecnologia (FCT) Grant PTDC/SAU-MII/099314/2008 (to J. P. S.) and Spanish Association Against Cancer and Spanish Ministry of Economy and Competitiveness Grants SAF2009-07172 and RD06/0020/0001, respectively (to X. R. B.). Spanish funding is co-sponsored by the European FEDER program. The SPR equipment at the Instituto de Tecnologia Química e Biológica was acquired with FCT Grant PNRC/692/BIO/2264/2005.Peer Reviewe
    corecore